

<!DOCTYPE html>
<html class="writer-html5" lang="en" >
<head>
  <meta charset="utf-8" />
  
  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  
  <title>CephFS Dynamic Metadata Management &mdash; Ceph Documentation</title>
  

  
  <link rel="stylesheet" href="../../_static/ceph.css" type="text/css" />
  <link rel="stylesheet" href="../../_static/pygments.css" type="text/css" />
  <link rel="stylesheet" href="../../_static/graphviz.css" type="text/css" />
  <link rel="stylesheet" href="../../_static/css/custom.css" type="text/css" />

  
  
    <link rel="shortcut icon" href="../../_static/favicon.ico"/>
  

  
  

  

  
  <!--[if lt IE 9]>
    <script src="../../_static/js/html5shiv.min.js"></script>
  <![endif]-->
  
    
      <script type="text/javascript" id="documentation_options" data-url_root="../../" src="../../_static/documentation_options.js"></script>
        <script src="../../_static/jquery.js"></script>
        <script src="../../_static/underscore.js"></script>
        <script src="../../_static/doctools.js"></script>
    
    <script type="text/javascript" src="../../_static/js/theme.js"></script>

    
    <link rel="index" title="Index" href="../../genindex/" />
    <link rel="search" title="Search" href="../../search/" />
    <link rel="next" title="Ceph 文件系统 IO 路径" href="../cephfs-io-path/" />
    <link rel="prev" title="CephFS Distributed Metadata Cache" href="../mdcache/" /> 
</head>

<body class="wy-body-for-nav">

   
  <header class="top-bar">
    

















<div role="navigation" aria-label="breadcrumbs navigation">

  <ul class="wy-breadcrumbs">
    
      <li><a href="../../" class="icon icon-home"></a> &raquo;</li>
        
          <li><a href="../">Ceph 文件系统</a> &raquo;</li>
        
      <li>CephFS Dynamic Metadata Management</li>
    
    
      <li class="wy-breadcrumbs-aside">
        
          
            <a href="../../_sources/cephfs/dynamic-metadata-management.rst.txt" rel="nofollow"> View page source</a>
          
        
      </li>
    
  </ul>

  
  <hr/>
</div>
  </header>
  <div class="wy-grid-for-nav">
    
    <nav data-toggle="wy-nav-shift" class="wy-nav-side">
      <div class="wy-side-scroll">
        <div class="wy-side-nav-search"  style="background: #eee" >
          

          
            <a href="../../">
          

          
            
            <img src="../../_static/logo.png" class="logo" alt="Logo"/>
          
          </a>

          

          
<div role="search">
  <form id="rtd-search-form" class="wy-form" action="../../search/" method="get">
    <input type="text" name="q" placeholder="Search docs" />
    <input type="hidden" name="check_keywords" value="yes" />
    <input type="hidden" name="area" value="default" />
  </form>
</div>

          
        </div>

        
        <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
          
            
            
              
            
            
              <ul class="current">
<li class="toctree-l1"><a class="reference internal" href="../../start/intro/">Ceph 简介</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../install/">安装 Ceph</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../cephadm/">Cephadm</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../rados/">Ceph 存储集群</a></li>
<li class="toctree-l1 current"><a class="reference internal" href="../">Ceph 文件系统</a><ul class="current">
<li class="toctree-l2"><a class="reference internal" href="../#cephfs">CephFS 入门</a></li>
<li class="toctree-l2"><a class="reference internal" href="../#id4">管理</a></li>
<li class="toctree-l2"><a class="reference internal" href="../#id5">挂载 CephFS</a></li>
<li class="toctree-l2 current"><a class="reference internal" href="../#id6">CephFS 内幕</a><ul class="current">
<li class="toctree-l3"><a class="reference internal" href="../mds-states/"> MDS 的各种状态</a></li>
<li class="toctree-l3"><a class="reference internal" href="../posix/"> POSIX 兼容性</a></li>
<li class="toctree-l3"><a class="reference internal" href="../mds-journaling/"> MDS 日志记录</a></li>
<li class="toctree-l3"><a class="reference internal" href="../file-layouts/"> 文件布局</a></li>
<li class="toctree-l3"><a class="reference internal" href="../mdcache/"> 分布式元数据缓存</a></li>
<li class="toctree-l3 current"><a class="current reference internal" href="#"> CephFS 内的动态元数据管理</a><ul>
<li class="toctree-l4"><a class="reference internal" href="#dynamic-subtree-partitioning">Dynamic Subtree Partitioning</a></li>
<li class="toctree-l4"><a class="reference internal" href="#export-process-during-subtree-migration">Export Process During Subtree Migration</a></li>
</ul>
</li>
<li class="toctree-l3"><a class="reference internal" href="../cephfs-io-path/"> CephFS IO 路径</a></li>
<li class="toctree-l3"><a class="reference internal" href="../lazyio/"> LazyIO</a></li>
<li class="toctree-l3"><a class="reference internal" href="../dirfrags/"> 目录分片的配置</a></li>
<li class="toctree-l3"><a class="reference internal" href="../multimds/"> 多活 MDS 的配置</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../#id7">故障排除和灾难恢复</a></li>
<li class="toctree-l2"><a class="reference internal" href="../#id9">更多细节</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../../rbd/">Ceph 块设备</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../radosgw/">Ceph 对象网关</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../mgr/">Ceph 管理器守护进程</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../mgr/dashboard/">Ceph 仪表盘</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../api/">API 文档</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../architecture/">体系结构</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../dev/developer_guide/">开发者指南</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../dev/internals/">Ceph 内幕</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../governance/">项目管理</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../foundation/">Ceph 基金会</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../ceph-volume/">ceph-volume</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../releases/general/">Ceph 版本（总目录）</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../releases/">Ceph 版本（索引）</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../security/">Security</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../glossary/">Ceph 术语</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../jaegertracing/">Tracing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../translation_cn/">中文版翻译资源</a></li>
</ul>

            
          
        </div>
        
      </div>
    </nav>

    <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">

      
      <nav class="wy-nav-top" aria-label="top navigation">
        
          <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
          <a href="../../">Ceph</a>
        
      </nav>


      <div class="wy-nav-content">
        
        <div class="rst-content">
        
          <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
           <div itemprop="articleBody">
            
<div id="dev-warning" class="admonition note">
  <p class="first admonition-title">Notice</p>
  <p class="last">This document is for a development version of Ceph.</p>
</div>
  <div id="docubetter" align="right" style="padding: 5px; font-weight: bold;">
    <a href="https://pad.ceph.com/p/Report_Documentation_Bugs">Report a Documentation Bug</a>
  </div>

  
  <div class="section" id="cephfs-dynamic-metadata-management">
<h1>CephFS Dynamic Metadata Management<a class="headerlink" href="#cephfs-dynamic-metadata-management" title="Permalink to this headline">¶</a></h1>
<p>Metadata operations usually take up more than 50 percent of all
file system operations. Also the metadata scales in a more complex
fashion when compared to scaling storage (which in turn scales I/O
throughput linearly). This is due to the hierarchical and
interdependent nature of the file system metadata. So in CephFS,
the metadata workload is decoupled from data workload so as to
avoid placing unnecessary strain on the RADOS cluster. The metadata
is hence handled by a cluster of Metadata Servers (MDSs).
CephFS distributes metadata across MDSs via <a class="reference external" href="https://ceph.com/wp-content/uploads/2016/08/weil-mds-sc04.pdf">Dynamic Subtree Partitioning</a>.</p>
<div class="section" id="dynamic-subtree-partitioning">
<h2>Dynamic Subtree Partitioning<a class="headerlink" href="#dynamic-subtree-partitioning" title="Permalink to this headline">¶</a></h2>
<p>In traditional subtree partitioning, subtrees of the file system
hierarchy are assigned to individual MDSs. This metadata distribution
strategy provides good hierarchical locality, linear growth of
cache and horizontal scaling across MDSs and a fairly good distribution
of metadata across MDSs.</p>
<img alt="../../_images/subtree-partitioning.svg" src="../../_images/subtree-partitioning.svg" /><p>The problem with traditional subtree partitioning is that the workload
growth by depth (across a single MDS) leads to a hotspot of activity.
This results in lack of vertical scaling and wastage of non-busy resources/MDSs.</p>
<p>This led to the adoption of a more dynamic way of handling
metadata: Dynamic Subtree Partitioning, where load intensive portions
of the directory hierarchy from busy MDSs are migrated to non busy MDSs.</p>
<p>This strategy ensures that activity hotspots are relieved as they
appear and so leads to vertical scaling of the metadata workload in
addition to horizontal scaling.</p>
</div>
<div class="section" id="export-process-during-subtree-migration">
<h2>Export Process During Subtree Migration<a class="headerlink" href="#export-process-during-subtree-migration" title="Permalink to this headline">¶</a></h2>
<p>Once the exporter verifies that the subtree is permissible to be exported
(Non degraded cluster, non-frozen subtree root), the subtree root
directory is temporarily auth pinned, the subtree freeze is initiated,
and the exporter is committed to the subtree migration, barring an
intervening failure of the importer or itself.</p>
<p>The MExportDiscover message is exchanged to ensure that the inode for the
base directory being exported is open on the destination node. It is
auth pinned by the importer to prevent it from being trimmed. This occurs
before the exporter completes the freeze of the subtree to ensure that
the importer is able to replicate the necessary metadata. When the
exporter receives the MDiscoverAck, it allows the freeze to proceed by
removing its temporary auth pin.</p>
<p>A warning stage occurs only if the base subtree directory is open by
nodes other than the importer and exporter. If it is not, then this
implies that no metadata within or nested beneath the subtree is
replicated by any node other than the importer and exporter. If it is,
then an MExportWarning message informs any bystanders that the
authority for the region is temporarily ambiguous, and lists both the
exporter and importer as authoritative MDS nodes. In particular,
bystanders who are trimming items from their cache must send
MCacheExpire messages to both the old and new authorities. This is
necessary to ensure that the surviving authority reliably receives all
expirations even if the importer or exporter fails. While the subtree
is frozen (on both the importer and exporter), expirations will not be
immediately processed; instead, they will be queued until the region
is unfrozen and it can be determined that the node is or is not
authoritative.</p>
<p>The exporter then packages an MExport message containing all metadata
of the subtree and flags the objects as non-authoritative. The MExport message sends
the actual subtree metadata to the importer. Upon receipt, the
importer inserts the data into its cache, marks all objects as
authoritative, and logs a copy of all metadata in an EImportStart
journal message. Once that has safely flushed, it replies with an
MExportAck. The exporter can now log an EExport journal entry, which
ultimately specifies that the export was a success. In the presence
of failures, it is the existence of the EExport entry only that
disambiguates authority during recovery.</p>
<p>Once logged, the exporter will send an MExportNotify to any
bystanders, informing them that the authority is no longer ambiguous
and cache expirations should be sent only to the new authority (the
importer). Once these are acknowledged back to the exporter,
implicitly flushing the bystander to exporter message streams of any
stray expiration notices, the exporter unfreezes the subtree, cleans
up its migration-related state, and sends a final MExportFinish to the
importer. Upon receipt, the importer logs an EImportFinish(true)
(noting locally that the export was indeed a success), unfreezes its
subtree, processes any queued cache expierations, and cleans up its
state.</p>
</div>
</div>



           </div>
           
          </div>
          <footer>
    <div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
        <a href="../cephfs-io-path/" class="btn btn-neutral float-right" title="Ceph 文件系统 IO 路径" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
        <a href="../mdcache/" class="btn btn-neutral float-left" title="CephFS Distributed Metadata Cache" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
    </div>

  <hr/>

  <div role="contentinfo">
    <p>
        &#169; Copyright 2016, Ceph authors and contributors. Licensed under Creative Commons Attribution Share Alike 3.0 (CC-BY-SA-3.0).

    </p>
  </div> 

</footer>
        </div>
      </div>

    </section>

  </div>
  

  <script type="text/javascript">
      jQuery(function () {
          SphinxRtdTheme.Navigation.enable(true);
      });
  </script>

  
  
    
   

</body>
</html>